Problem Statement¶
PROJECTO 1
Análisis exploratorio y modelado predictivo de precios de viviendas en Barcelona usando Python y SQL
Objective¶
Desarrollar un análisis completo y un modelo predictivo para los precios de viviendas en Barcelona, utilizando datos extraídos del portal Fotocasa. El objetivo es aplicar técnicas de extracción, manipulación y análisis de datos, así como algoritmos de Machine Learning, para predecir los precios de las viviendas en función de diversas características.
Data Description¶
- price: The price of the real-state.
- rooms: Number of rooms.
- bathroom: Number of bathrooms.
- lift: whether a building has an elevator (also known as a lift in some regions) or not
- terrace: If it has a terrace or not.
- square_meters: Number of square meters.
- real_state: Kind of real-state.
- neighborhood: Neighborhood
- square_meters_price: Price of the square meter
Importing necessary libraries¶
In [288]:
import pandas as pd
import numpy as np
# To help with data visualization
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
%matplotlib inline
sns.set_style('whitegrid') # set style for visualization
# To supress warnings
import warnings # ignore warnings
warnings.filterwarnings('ignore')
from scipy.stats import zscore
#normalizing
from sklearn.preprocessing import MinMaxScaler, StandardScaler # to scale the data
# modeling
import statsmodels.api as sm # adding a constant to the independent variables
from sklearn.model_selection import train_test_split # splitting data in train and test sets
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# for validation
from sklearn.model_selection import cross_val_score, KFold
# Deploy
import joblib
Loading the Dataset¶
In [289]:
df=pd.read_csv('Barcelona_Fotocasa_HousingPrices.csv')
Data Overview¶
In [290]:
df.head() # preview a sample first 5 rows
Out[290]:
| Unnamed: 0 | price | rooms | bathroom | lift | terrace | square_meters | real_state | neighborhood | square_meters_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 750 | 3 | 1 | True | False | 60 | flat | Horta- Guinardo | 12.500000 |
| 1 | 1 | 770 | 2 | 1 | True | False | 59 | flat | Sant Andreu | 13.050847 |
| 2 | 2 | 1300 | 1 | 1 | True | True | 30 | flat | Gràcia | 43.333333 |
| 3 | 3 | 2800 | 1 | 1 | True | True | 70 | flat | Ciutat Vella | 40.000000 |
| 4 | 4 | 720 | 2 | 1 | True | False | 44 | flat | Sant Andreu | 16.363636 |
In [291]:
df.tail() # preview a sample last 5 rows
Out[291]:
| Unnamed: 0 | price | rooms | bathroom | lift | terrace | square_meters | real_state | neighborhood | square_meters_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 8183 | 8183 | 1075 | 2 | 2 | False | False | 65 | flat | Gràcia | 16.538462 |
| 8184 | 8184 | 1500 | 3 | 2 | True | False | 110 | flat | Eixample | 13.636364 |
| 8185 | 8185 | 1500 | 2 | 2 | True | True | 90 | flat | Sarria-Sant Gervasi | 16.666667 |
| 8186 | 8186 | 1500 | 3 | 2 | True | False | 110 | flat | Eixample | 13.636364 |
| 8187 | 8187 | 1500 | 3 | 2 | True | False | 110 | flat | Eixample | 13.636364 |
In [292]:
df.sample(20) # preview a sample random n rows
Out[292]:
| Unnamed: 0 | price | rooms | bathroom | lift | terrace | square_meters | real_state | neighborhood | square_meters_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 64 | 64 | 1290 | 2 | 1 | True | False | 70 | flat | Les Corts | 18.428571 |
| 5695 | 5695 | 821 | 3 | 1 | True | False | 70 | flat | Sant Andreu | 11.728571 |
| 2514 | 2514 | 826 | 2 | 1 | True | False | 58 | flat | Eixample | 14.241379 |
| 6274 | 6274 | 1195 | 2 | 1 | True | True | 70 | attic | Eixample | 17.071429 |
| 1462 | 1462 | 2250 | 3 | 2 | True | False | 170 | flat | Sarria-Sant Gervasi | 13.235294 |
| 7803 | 7803 | 750 | 0 | 1 | True | True | 43 | flat | Sarria-Sant Gervasi | 17.441860 |
| 6078 | 6078 | 736 | 1 | 1 | True | False | 46 | flat | Eixample | 16.000000 |
| 7936 | 7936 | 1000 | 1 | 1 | True | False | 95 | flat | Eixample | 10.526316 |
| 3911 | 3911 | 1200 | 2 | 1 | True | False | 70 | flat | Sant Martí | 17.142857 |
| 7016 | 7016 | 800 | 2 | 1 | True | False | 57 | apartment | Sants-Montjuïc | 14.035088 |
| 5061 | 5061 | 900 | 2 | 1 | False | False | 60 | flat | Eixample | 15.000000 |
| 1689 | 1689 | 960 | 2 | 2 | True | False | 65 | flat | Ciutat Vella | 14.769231 |
| 3045 | 3045 | 1100 | 4 | 1 | False | False | 80 | flat | Eixample | 13.750000 |
| 1819 | 1819 | 750 | 2 | 1 | False | False | 60 | flat | Gràcia | 12.500000 |
| 2643 | 2643 | 2500 | 3 | 2 | True | False | 185 | flat | Sarria-Sant Gervasi | 13.513514 |
| 4891 | 4891 | 950 | 3 | 1 | True | False | 70 | flat | Sarria-Sant Gervasi | 13.571429 |
| 5415 | 5415 | 900 | 2 | 1 | True | False | 70 | flat | Eixample | 12.857143 |
| 3761 | 3761 | 975 | 3 | 1 | True | False | 60 | flat | Eixample | 16.250000 |
| 2306 | 2306 | 1200 | 2 | 1 | True | True | 80 | flat | Eixample | 15.000000 |
| 6678 | 6678 | 1135 | 3 | 1 | True | False | 93 | flat | Eixample | 12.204301 |
- The variable 'Unnamed' represent index and should be deleted from data
- Target variable for modeling is "price"
In [293]:
print("There are", df.shape[0], 'rows and', df.shape[1], "columns.") # number of observations and features
There are 8188 rows and 10 columns.
- There are 8188 rows and 10 columns.
In [294]:
df.dtypes # data types
Out[294]:
Unnamed: 0 int64 price int64 rooms int64 bathroom int64 lift bool terrace bool square_meters int64 real_state object neighborhood object square_meters_price float64 dtype: object
- Data types are aligned with information
In [295]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8188 entries, 0 to 8187 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 8188 non-null int64 1 price 8188 non-null int64 2 rooms 8188 non-null int64 3 bathroom 8188 non-null int64 4 lift 8188 non-null bool 5 terrace 8188 non-null bool 6 square_meters 8188 non-null int64 7 real_state 7920 non-null object 8 neighborhood 8188 non-null object 9 square_meters_price 8188 non-null float64 dtypes: bool(2), float64(1), int64(5), object(2) memory usage: 527.9+ KB
- There is missing data (NaN) on variable 'real_state'. To be replaced by "unknown"
In [296]:
df.describe(include="all").T # statistical summary of the data.
Out[296]:
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | 8188.0 | NaN | NaN | NaN | 4093.5 | 2363.816335 | 0.0 | 2046.75 | 4093.5 | 6140.25 | 8187.0 |
| price | 8188.0 | NaN | NaN | NaN | 1444.092574 | 1125.886215 | 320.0 | 875.0 | 1100.0 | 1540.0 | 15000.0 |
| rooms | 8188.0 | NaN | NaN | NaN | 2.420738 | 1.138592 | 0.0 | 2.0 | 2.0 | 3.0 | 10.0 |
| bathroom | 8188.0 | NaN | NaN | NaN | 1.508793 | 0.732798 | 1.0 | 1.0 | 1.0 | 2.0 | 8.0 |
| lift | 8188 | 2 | True | 5710 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| terrace | 8188 | 2 | False | 6518 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| square_meters | 8188.0 | NaN | NaN | NaN | 84.610161 | 47.874028 | 10.0 | 56.0 | 73.0 | 95.0 | 679.0 |
| real_state | 7920 | 4 | flat | 6505 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| neighborhood | 8188 | 10 | Eixample | 2401 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| square_meters_price | 8188.0 | NaN | NaN | NaN | 17.739121 | 9.245241 | 4.910714 | 12.790698 | 15.306122 | 19.444444 | 186.666667 |
- Units size goes from 10m2 to 679m2, with a mean of 84.61m2
- Units prices goes from 320EUR to 15000EUR/month, with mean of 1444EUR/month
- price range is assumed referred to monthly rent, so considered as EUR per month
- Units prices by square meter goes from 4.9EUR/m2/month to 186EUR/m2/month, with mean of 17.7EUR/m2/month
- There are units listed with cero rooms
In [297]:
# Uniques
df.nunique() # Checking for number of variations in the data
Out[297]:
Unnamed: 0 8188 price 889 rooms 10 bathroom 8 lift 2 terrace 2 square_meters 278 real_state 4 neighborhood 10 square_meters_price 2921 dtype: int64
In [298]:
df.columns
Out[298]:
Index(['Unnamed: 0', 'price', 'rooms', 'bathroom', 'lift', 'terrace',
'square_meters', 'real_state', 'neighborhood', 'square_meters_price'],
dtype='object')
In [299]:
for i in ['rooms', 'bathroom', 'lift', 'terrace', 'real_state', 'neighborhood']: # Checking uniques
print (i,": ",df[i].unique())
rooms : [ 3 2 1 4 5 0 7 6 10 9] bathroom : [1 2 3 4 7 5 6 8] lift : [ True False] terrace : [False True] real_state : ['flat' 'attic' nan 'apartment' 'study'] neighborhood : ['Horta- Guinardo' 'Sant Andreu' 'Gràcia' 'Ciutat Vella' 'Sarria-Sant Gervasi' 'Les Corts' 'Sant Martí' 'Eixample' 'Sants-Montjuïc' 'Nou Barris']
In [301]:
# Uniques
cat_cols = df.select_dtypes(include=['category', 'object','bool']).columns.tolist()
for column in cat_cols:
print(df[column].value_counts())
print("-" * 50)
lift True 5710 False 2478 Name: count, dtype: int64 -------------------------------------------------- terrace False 6518 True 1670 Name: count, dtype: int64 -------------------------------------------------- real_state flat 6505 apartment 991 attic 315 study 109 Name: count, dtype: int64 -------------------------------------------------- neighborhood Eixample 2401 Sarria-Sant Gervasi 1396 Ciutat Vella 1365 Gràcia 688 Sant Martí 617 Sants-Montjuïc 590 Les Corts 524 Horta- Guinardo 315 Sant Andreu 178 Nou Barris 114 Name: count, dtype: int64 --------------------------------------------------
- There are four types of real states being the most common "flat"
- Most units do not have terrace
- Most units do have lift
- The neighborhood with largest unit count is "Eixample"
In [302]:
# Duplicates
df.duplicated().sum() # Checking for duplicate entries in the data
Out[302]:
np.int64(0)
- There are no duplicated observations
Consolidated notes on Data Overview¶
- The variable 'Unnamed 0:' represent index and should be deleted from data
- Target variable for modeling is "price"
- There are 8188 rows and 10 columns.
- Data types are aligned with information
- There is missing data (NaN) on variable 'real_state'. To be replaced by "unknown"
- Units size goes from 10m2 to 679m2, with a mean of 84.61m2
- Units prices goes from 320EUR to 15000EUR/month, with mean of 1444EUR/month
- price range is assumed referred to monthly rent, so considered as EUR per month
- Units prices by square meter goes from 4.9EUR/m2/month to 186EUR/m2/month, with mean of 17.7EUR/m2/month
- There are units listed with cero rooms
- There are four types of real states being the most common "flat"
- Most units do not have terrace
- Most units do have lift
- The neighborhood with largest unit count is "Eixample"
- There are no duplicated observations
Exploratory Data Analysis (EDA)¶
EDA Functions¶
In [303]:
def univariate_numerical(data):
'''
Function to generate two plots for each numerical variable
Histplot for variable distribution
Boxplot for statistical summary
'''
# Select numerical columns
numerical_cols = data.select_dtypes(include=[np.number]).columns
# Determine the number of rows and columns
num_vars = len(numerical_cols)
num_cols = 4
num_rows = int(np.ceil(num_vars * 2 / num_cols))
# Create a figure with the specified size
fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5))
# Flatten the axes array for easy iteration
axes = axes.flatten()
# Plot each variable with a histplot and a boxplot
for i, col in enumerate(numerical_cols):
mean_value = data[col].mean()
# Histplot with KDE
sns.histplot(data[col], kde=True, ax=axes[i*2])
axes[i*2].axvline(mean_value, color='r', linestyle='--')
axes[i*2].set_title(f'Distribution of {col}')
axes[i*2].text(mean_value, axes[i*2].get_ylim()[1]*0.8, f'Mean: {mean_value:.2f}', color='r', va='baseline', ha='left',rotation=90)
# Boxplot
sns.boxplot(y=data[col], ax=axes[i*2 + 1])
axes[i*2 + 1].axhline(mean_value, color='r', linestyle='--')
axes[i*2 + 1].set_title(f'Boxplot of {col}')
axes[i*2 + 1].text(axes[i*2 + 1].get_xlim()[1]*0.8, mean_value, f'mean: {mean_value:.2f}', color='r', va='baseline', ha='right')
# Hide any remaining empty subplots
for j in range(num_vars * 2, len(axes)):
fig.delaxes(axes[j])
# Adjust layout
plt.tight_layout()
plt.show()
In [304]:
def univariate_categorical(data):
'''
Function to generate countplot for each categorical variable
Labeled with count and percentage
'''
# List of categorical columns
categorical_columns = data.select_dtypes(include=['object', 'category']).columns.tolist()
# Number of columns in the grid
num_cols = 4
# Calculate the number of rows needed
num_rows = (len(categorical_columns) + num_cols - 1) // num_cols
# Create the grid
fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5), constrained_layout=True)
axes = axes.flatten()
# Plot each countplot in the grid
for i, col in enumerate(categorical_columns):
ax = axes[i]
plot = sns.countplot(x=col, data=data, order=data[col].value_counts().index, ax=ax)
ax.set_title(f'Count of {col}')
# Add total count and percentage annotations
total = len(data)
for p in plot.patches:
height = p.get_height()
percentage = f'{(height / total * 100):.1f}%'
plot.text(x=p.get_x() + p.get_width() / 2,
y=height + 2,
s=f'{height:.0f}\n({percentage})',
ha='center')
# Limit x-axis labels to avoid overlap
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
# Remove any empty subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
# Show the plot
plt.show()
In [305]:
# Function to plot crosstab with labels
def plot_crosstab_bar_count(df, var_interest):
'''
Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables
Labeled with counts
'''
# Extract categorical columns excluding the variable of interest
cat_cols = df.select_dtypes(include=['category', 'object','bool']).columns.tolist()
cat_cols.remove(var_interest)
# Determine the grid size
num_vars = len(cat_cols)
num_cols = 3 # Number of columns in the grid
num_rows = (num_vars // num_cols) + int(num_vars % num_cols > 0)
# Create a grid of subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5), constrained_layout=True)
axes = axes.flatten() # Flatten the axes array for easy iteration
for i, col in enumerate(cat_cols):
# Create a crosstab
crosstab = pd.crosstab(df[col], df[var_interest])
# Plot the crosstab as a bar plot
crosstab.plot(kind='bar', stacked=True, ax=axes[i])
# Annotate counts in the middle of each bar section
for bar in axes[i].patches:
height = bar.get_height()
if height > 0:
axes[i].annotate(f'{int(height)}',
(bar.get_x() + bar.get_width() / 2, bar.get_y() + height / 2),
ha='center', va='center', fontsize=10, color='black')
# Add total labels at the top of each bar
totals = crosstab.sum(axis=1)
for j, total in enumerate(totals):
axes[i].annotate(f'Total: {total}',
(j, totals[j]),
ha='center', va='bottom', weight='bold')
# Hide any remaining empty subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
# Usage
#plot_crosstab_bar_count(df, var_interest='var_interest')
In [306]:
def plot_crosstab_heat_perc(df, var_interest,df_name="DataFrame"):
'''
Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables
Labeled with counts, percentage by row, percentage by column
'''
# Extract categorical columns excluding the variable of interest
cat_cols = df.select_dtypes(include=['category', 'object']).columns.tolist()
cat_cols.remove(var_interest)
# Determine the grid size
num_vars = len(cat_cols)
num_cols = 3 # Number of columns in the grid
num_rows = (num_vars // num_cols) + int(num_vars % num_cols > 0)
# Create a grid of subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(6*num_cols, num_rows * 6))
axes = axes.flatten() # Flatten the axes array for easy iteration
for i, col in enumerate(cat_cols):
# Create crosstabs
crosstab = pd.crosstab(df[col], df[var_interest])
crosstab_perc_row = crosstab.div(crosstab.sum(axis=1), axis=0) * 100
crosstab_perc_col = crosstab.div(crosstab.sum(axis=0), axis=1) * 100
# Combine counts with percentages
crosstab_combined = crosstab.astype(str) + "\n" + \
crosstab_perc_row.round(2).astype(str) + "%" + "\n" + \
crosstab_perc_col.round(2).astype(str) + "%"
# Plot the crosstab as a heatmap
sns.heatmap(crosstab, annot=crosstab_combined, fmt='', cmap='Blues', ax=axes[i], cbar=False, annot_kws={"size": 8})
axes[i].set_title(f'Crosstab of {col} and {var_interest} - {df_name}', fontsize=12)
# Hide any remaining empty subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
# Adjust layout to prevent label overlapping
plt.subplots_adjust(hspace=0.4, wspace=0.4) # Add more space between subplots
plt.tight_layout()
plt.show()
# Usage
#plot_crosstab_heat_perc(df, var_interest='var_interest')
In [307]:
def boxplot_by_group(df, group, var, outliers, df_name="DataFrame"):
'''
boxplot for a numerical variable of interest vs a categorical variable
with or without outliers
includes data mean and mean by category
'''
# Calculate the average for the variable
var_avg = df[var].mean()
# Calculate variable mean per group
var_means = df.groupby(group)[var].mean()
# Sort by means and get the sorted order
var_sorted = var_means.sort_values(ascending=False).index
# Reorder the DataFrame by the sorted group
df[group] = pd.Categorical(df[group], categories=var_sorted, ordered=True)
# Create the boxplot with the reordered sectors
ax = sns.boxplot(data=df, x=group, y=var, order=var_sorted, showfliers=outliers)
# Add horizontal line for average variable value
plt.axhline(var_avg, color='red', linestyle='--', label=f'Avg {var}: {var_avg:.2f}')
# Scatter plot for means
x_positions = range(len(var_means.sort_values(ascending=False)))
plt.scatter(x=x_positions, y=var_means.sort_values(ascending=False), color='red', label='Mean', zorder=5)
# Add labels to each red dot with the mean value
for i, mean in enumerate(var_means.sort_values(ascending=False)):
plt.text(i, mean, f'{mean:.2f}', color='red', ha='center', va='bottom')
# Rotate x-axis labels
plt.xticks(ticks=x_positions, labels=var_means.sort_values(ascending=False).index, rotation=90)
# Add a legend
plt.legend()
plt.xlabel('') # Remove x-axis title
# Add plot title with DataFrame name
plt.title(f'Boxplot of {var} by {group} - {df_name}')
# Adjust layout
plt.tight_layout()
# Display the plot
#plt.show()
Functions:
- univariate_numerical(data): Function to generate two plots for each numerical variable. Histplot for variable distribution. Boxplot for statistical summary
- univariate_categorical(data): Function to generate countplot for each categorical variable. Labeled with count and percentage
- plot_crosstab_bar_count(df, var_interest): Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables. Labeled with counts
- plot_crosstab_heat_perc(df, var_interest): Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables. Labeled with counts, percentage by row, percentage by column
- boxplot_by_group(df, group, var, outliers): boxplot for a numerical variable of interest vs a categorical variable. with or without outliers. includes data mean and mean by category
Univariate Analysis¶
In [308]:
univariate_numerical(df)
- The variables "Unnamed: 0" have a uniform distribution
- The numerical variables have a shift to the right
In [309]:
univariate_categorical(df);
In [310]:
df.loc[(df['real_state']=="flat")].describe().T
Out[310]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | 6505.0 | 4022.892083 | 2319.413714 | 0.000000 | 2059.0 | 3989.0 | 5943.000000 | 8187.0 |
| price | 6505.0 | 1316.758032 | 913.617019 | 320.000000 | 865.0 | 1050.0 | 1380.000000 | 15000.0 |
| rooms | 6505.0 | 2.541430 | 1.091669 | 0.000000 | 2.0 | 3.0 | 3.000000 | 10.0 |
| bathroom | 6505.0 | 1.510530 | 0.720411 | 1.000000 | 1.0 | 1.0 | 2.000000 | 8.0 |
| square_meters | 6505.0 | 85.514527 | 45.535769 | 11.000000 | 60.0 | 75.0 | 95.000000 | 679.0 |
| square_meters_price | 6505.0 | 15.755932 | 5.405202 | 5.555556 | 12.5 | 14.5 | 17.692308 | 100.0 |
In [311]:
df.loc[(df['neighborhood']=="Eixample")].describe().T
Out[311]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | 2401.0 | 4202.712620 | 2371.995010 | 12.000000 | 2057.0 | 4236.000000 | 6284.00 | 8187.0 |
| price | 2401.0 | 1592.147022 | 1150.448403 | 490.000000 | 1000.0 | 1225.000000 | 1700.00 | 15000.0 |
| rooms | 2401.0 | 2.563099 | 1.098994 | 0.000000 | 2.0 | 3.000000 | 3.00 | 9.0 |
| bathroom | 2401.0 | 1.583923 | 0.681096 | 1.000000 | 1.0 | 2.000000 | 2.00 | 8.0 |
| square_meters | 2401.0 | 87.946689 | 40.787208 | 20.000000 | 64.0 | 80.000000 | 100.00 | 350.0 |
| square_meters_price | 2401.0 | 19.018949 | 11.321671 | 6.741573 | 13.0 | 15.714286 | 20.63 | 144.5 |
- The categorical variables are not balanced, with 79% of properties as "apartments" and 78% of units concentrated in 50% of the sample neighbourhoods
- 75% of flats units have up to 3 bedrooms and up to 2 bathrooms with an average size of 85m2.
- 75% of the units in Eixample have up to 3 bedrooms and up to 2 bathrooms with an average size of 87m2.
Bivariate Analysis¶
In [312]:
# Create a PairGrid
g = sns.PairGrid(df, corner=True)
# Map different plots to the grid
g.map_lower(sns.scatterplot)
g.map_diag(sns.histplot,kde=True)
# Show the plot
plt.show()
In [313]:
# Calculate correlation matrix
corr_matrix = df.select_dtypes(include=np.number).corr()
In [314]:
# Plot correlation matrix as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()
In [315]:
# Display the sorted correlation table
corr_unstacked = corr_matrix.unstack() # Unstack the correlation matrix
corr_unstacked = corr_unstacked.reset_index() # Reset the index to get 'variable1' and 'variable2' as columns
corr_unstacked.columns = ['variable1', 'variable2', 'correlation']# Rename the columns for better understanding
corr_unstacked = corr_unstacked[corr_unstacked['variable1'] != corr_unstacked['variable2']] # Remove self-correlations by filtering out rows where variable1 == variable2
corr_unstacked = corr_unstacked.drop_duplicates(subset=['correlation']) # Drop duplicates to keep only one entry per variable pair
sorted_corr = corr_unstacked.sort_values(by='correlation', key=abs, ascending=False) # Sort the DataFrame by the absolute value of correlation
#sorted_corr # Display the sorted correlation table
In [316]:
# Define a function to categorize the correlation level
def categorize_correlation(correlation):
abs_corr = abs(correlation) * 100 # Convert to percentage for easier comparison
if abs_corr < 30:
return 'Negligible'
elif 30 <= abs_corr < 50:
return 'Low'
elif 50 <= abs_corr < 70:
return 'Moderate'
elif 70 <= abs_corr < 90:
return 'High'
else:
return 'Very High'
In [317]:
# Apply the function to create the corr_lvl column
sorted_corr['corr_lvl'] = sorted_corr['correlation'].apply(categorize_correlation)
sorted_corr['corr_lvl'].value_counts()
Out[317]:
corr_lvl Negligible 8 Moderate 5 High 1 Low 1 Name: count, dtype: int64
In [318]:
sorted_corr
Out[318]:
| variable1 | variable2 | correlation | corr_lvl | |
|---|---|---|---|---|
| 22 | bathroom | square_meters | 0.755291 | High |
| 10 | price | square_meters | 0.690398 | Moderate |
| 16 | rooms | square_meters | 0.644285 | Moderate |
| 9 | price | bathroom | 0.579747 | Moderate |
| 15 | rooms | bathroom | 0.572576 | Moderate |
| 11 | price | square_meters_price | 0.520627 | Moderate |
| 8 | price | rooms | 0.348086 | Low |
| 17 | rooms | square_meters_price | -0.223946 | Negligible |
| 29 | square_meters | square_meters_price | -0.128385 | Negligible |
| 5 | Unnamed: 0 | square_meters_price | 0.069195 | Negligible |
| 4 | Unnamed: 0 | square_meters | -0.055379 | Negligible |
| 2 | Unnamed: 0 | rooms | -0.042374 | Negligible |
| 3 | Unnamed: 0 | bathroom | -0.040113 | Negligible |
| 23 | bathroom | square_meters_price | -0.013050 | Negligible |
| 1 | Unnamed: 0 | price | -0.010116 | Negligible |
- 'square_meters' has a positive correlation with 'price', 'rooms' and 'bathrooms'
- 'square_meters_price' has a negative correlation with 'square_meters', 'rooms' and 'bathrooms'
- There are only one couple of variables with high correlation: bathroom-square_meters (0.75)
In [319]:
df.columns
Out[319]:
Index(['Unnamed: 0', 'price', 'rooms', 'bathroom', 'lift', 'terrace',
'square_meters', 'real_state', 'neighborhood', 'square_meters_price'],
dtype='object')
In [320]:
# check unique rooms-bathroom combinations
unique_combinations=df.groupby(['rooms', 'bathroom']).size().reset_index(name='count')
unique_combinations_sorted=unique_combinations.sort_values(by='count',ascending=False)
In [321]:
# Calculate cumulative sum of counts
unique_combinations_sorted['cum_sum'] = unique_combinations_sorted['count'].cumsum()
In [322]:
# Calculate the cumulative percentage
unique_combinations_sorted['perc'] = unique_combinations_sorted['count'] / unique_combinations_sorted['count'].sum() * 100
unique_combinations_sorted['cum_perc'] = unique_combinations_sorted['cum_sum'] / unique_combinations_sorted['count'].sum() * 100
unique_combinations_sorted.head(10)
Out[322]:
| rooms | bathroom | count | cum_sum | perc | cum_perc | |
|---|---|---|---|---|---|---|
| 6 | 2 | 1 | 1836 | 1836 | 22.423058 | 22.423058 |
| 4 | 1 | 1 | 1519 | 3355 | 18.551539 | 40.974597 |
| 11 | 3 | 2 | 1207 | 4562 | 14.741085 | 55.715681 |
| 10 | 3 | 1 | 1123 | 5685 | 13.715193 | 69.430874 |
| 7 | 2 | 2 | 735 | 6420 | 8.976551 | 78.407426 |
| 17 | 4 | 2 | 641 | 7061 | 7.828530 | 86.235955 |
| 16 | 4 | 1 | 191 | 7252 | 2.332682 | 88.568637 |
| 0 | 0 | 1 | 188 | 7440 | 2.296043 | 90.864680 |
| 18 | 4 | 3 | 173 | 7613 | 2.112848 | 92.977528 |
| 12 | 3 | 3 | 111 | 7724 | 1.355642 | 94.333170 |
- The most popular unit configuration in the dataset is 2 bedrooms and 1 bathroom with 1836 units (22.42% of all units).
- Other popular configurations are 1-1 (18.55%), 3-2 (14.74%), 3-1 (13.71%), 2-2 (8.97%) and 4-2 (7.82%)
- These six most popular unit configurations represent 86.23% of all units
In [323]:
df_pop = df.merge(unique_combinations_sorted.head(10), on=['rooms', 'bathroom'])
df_pop
Out[323]:
| Unnamed: 0 | price | rooms | bathroom | lift | terrace | square_meters | real_state | neighborhood | square_meters_price | count | cum_sum | perc | cum_perc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 750 | 3 | 1 | True | False | 60 | flat | Horta- Guinardo | 12.500000 | 1123 | 5685 | 13.715193 | 69.430874 |
| 1 | 1 | 770 | 2 | 1 | True | False | 59 | flat | Sant Andreu | 13.050847 | 1836 | 1836 | 22.423058 | 22.423058 |
| 2 | 2 | 1300 | 1 | 1 | True | True | 30 | flat | Gràcia | 43.333333 | 1519 | 3355 | 18.551539 | 40.974597 |
| 3 | 3 | 2800 | 1 | 1 | True | True | 70 | flat | Ciutat Vella | 40.000000 | 1519 | 3355 | 18.551539 | 40.974597 |
| 4 | 4 | 720 | 2 | 1 | True | False | 44 | flat | Sant Andreu | 16.363636 | 1836 | 1836 | 22.423058 | 22.423058 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7719 | 8183 | 1075 | 2 | 2 | False | False | 65 | flat | Gràcia | 16.538462 | 735 | 6420 | 8.976551 | 78.407426 |
| 7720 | 8184 | 1500 | 3 | 2 | True | False | 110 | flat | Eixample | 13.636364 | 1207 | 4562 | 14.741085 | 55.715681 |
| 7721 | 8185 | 1500 | 2 | 2 | True | True | 90 | flat | Sarria-Sant Gervasi | 16.666667 | 735 | 6420 | 8.976551 | 78.407426 |
| 7722 | 8186 | 1500 | 3 | 2 | True | False | 110 | flat | Eixample | 13.636364 | 1207 | 4562 | 14.741085 | 55.715681 |
| 7723 | 8187 | 1500 | 3 | 2 | True | False | 110 | flat | Eixample | 13.636364 | 1207 | 4562 | 14.741085 | 55.715681 |
7724 rows × 14 columns
In [324]:
print(df.shape,df_pop.shape)
(8188, 10) (7724, 14)
- The "df_pop" data frame includes the most popular units in terms of bedroom/bathroom configuration, representing 94% of the samples
In [325]:
# Define the function to create and display side-by-side boxplots
def side_by_side_boxplot(df1, df2, group, var, outliers, title1, title2):
fig, axes = plt.subplots(1, 2, figsize=(18, 6), sharey=True)
# First subplot for df1
plt.sca(axes[0])
boxplot_by_group(df1, group, var, outliers, title1)
# Second subplot for df2
plt.sca(axes[1])
boxplot_by_group(df2, group, var, outliers, title2)
# Show both plots after setup
plt.show()
# Usage
#side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', True, "All units (show outliers)", "Popular units (show outliers)")
In [326]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', True, "All units (show outliers)", "Popular units (show outliers)")
In [327]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', False, "All units (without outliers)", "Popular units (without outliers)")
- Sarrià-Sant Gervasi, Les Corts, Eixample and San Martí are the most expensive neighbourhoods with average prices above the dataset average.
- Sants-Montjuïc, Horta-Guinardó, Sant Andreu and Nou Barris are the cheapest neighbourhoods with average prices below the dataset average.
In [328]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'square_meters_price', False, "All units (without outliers)", "Popular units (without outliers)")
- When comparing the price per square metre, Ciutat Vella and Eixample are the most expensive neighbourhoods.
- If we compare square meters, Ciutat Vella is the second lowest and Eixample the third
In [329]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'square_meters', False, "All units (without outliers)", "Popular units (without outliers)")
- From the perspective of price per square meter, the most attractive neighborhood according to this data could be Les Corts, with an average surface area of 89.79 m2 above the average (78.67 m2) and a price per square meter of 15.85 below the average (17.79)
In [330]:
side_by_side_boxplot(df, df_pop, 'real_state' , 'price', False, "All units (without outliers)", "Popular units (without outliers)")
In [331]:
side_by_side_boxplot(df, df_pop, 'real_state' , 'square_meters_price', False, "All units (without outliers)", "Popular units (without outliers)")
In [332]:
side_by_side_boxplot(df, df_pop, 'real_state' , 'square_meters', False, "All units (without outliers)", "Popular units (without outliers)")
- From the perspective of price per square meter, the most attractive type of unit according to this data could be the apartment, with an average surface area of 80 m2 above the average (78.67 m2) and a price per square meter of 15.76 below the average (17.79)
In [333]:
plot_crosstab_heat_perc(df, var_interest='real_state',df_name="All units")
In [334]:
plot_crosstab_heat_perc(df_pop, var_interest='real_state',df_name="Popular units")
- There are 1,777 flats in Eixample, being the most popular unit type and neighborhood combination, with 79.68% of the units in Eixample being flats, and 28.9% of the flats are in Eixample. - In Les Courts there are only 398 flats, which makes it far from the most popular type of housing and neighbourhood combination, although 87.67% of the dwellings in Les Courts are flats, only 6.47% of the flats are in Les Courts.
In [335]:
plot_crosstab_bar_count(df, var_interest='lift')
- Most types of units have a lift, in the case of flats the proportion is 74.12%
In [336]:
plot_crosstab_bar_count(df, var_interest='terrace')
- Units with a terrace on the other hand, seem to be rare and very few have one
Consolidated notes on Exploratory Data Analysis¶
- univariate_numerical(data): Function to generate two plots for each numerical variable. Histplot for variable distribution. Boxplot for statistical summary
- univariate_categorical(data): Function to generate countplot for each categorical variable. Labeled with count and percentage
- plot_crosstab_bar_count(df, var_interest): Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables. Labeled with counts
- plot_crosstab_heat_perc(df, var_interest): Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables. Labeled with counts, percentage by row, percentage by column
- boxplot_by_group(df, group, var, outliers): boxplot for a numerical variable of interest vs a categorical variable. with or without outliers. includes data mean and mean by category
- The variables "Unnamed: 0" have a uniform distribution
- The numerical variables have a shift to the right
- The categorical variables are not balanced, with 79% of properties as "apartments" and 78% of units concentrated in 50% of the sample neighbourhoods
- 75% of "flat" units have up to 3 bedrooms and up to 2 bathrooms with an average size of 85m2.
- 75% of the units in Eixample have up to 3 bedrooms and up to 2 bathrooms with an average size of 87m2.
- 'square_meters' has a positive correlation with 'price', 'rooms' and 'bathrooms'
- 'square_meters_price' has a negative correlation with 'square_meters', 'rooms' and 'bathrooms'
- There are only one couple of variables with high correlation: bathroom-square_meters (0.75)
- The most popular unit configuration in the dataset is 2 bedrooms and 1 bathroom with 1836 units (22.42% of all units).
- Other popular configurations are 1-1 (18.55%), 3-2 (14.74%), 3-1 (13.71%), 2-2 (8.97%) and 4-2 (7.82%)
- These six most popular unit configurations represent 86.23% of all units
- The "df_pop" data frame includes the most popular units in terms of bedroom/bathroom configuration, representing 94% of the samples
- Sarrià-Sant Gervasi, Les Corts, Eixample and San Martí are the most expensive neighbourhoods with average prices above the dataset average.
- Sants-Montjuïc, Horta-Guinardó, Sant Andreu and Nou Barris are the cheapest neighbourhoods with average prices below the dataset average.
- When comparing the price per square metre, Ciutat Vella and Eixample are the most expensive neighbourhoods. - If we compare square meters, Ciutat Vella is the second lowest and Eixample the third
- From the perspective of price per square meter, the most attractive neighborhood according to this data could be Les Corts, with an average surface area of 89.79 m2 above the average (78.67 m2) and a price per square meter of 15.85 below the average (17.79)
- From the perspective of price per square meter, the most attractive type of unit according to this data could be the apartment, with an average surface area of 80 m2 above the average (78.67 m2) and a price per square meter of 15.76 below the average (17.79)
- There are 1,777 flats in Eixample, being the most popular unit type and neighborhood combination, with 79.68% of the units in Eixample being flats, and 28.9% of the flats are in Eixample. - In Les Courts there are only 398 flats, which makes it far from the most popular type of housing and neighbourhood combination, although 87.67% of the dwellings in Les Courts are flats, only 6.47% of the flats are in Les Courts.
- Most types of units have a lift, in the case of flats the proportion is 74.12%
- Units with a terrace on the other hand, seem to be rare and very few have one
Data Preprocesing¶
- Missing value treatment
- Feature engineering
- Outlier detection and treatment
- Any other preprocessing steps
In [337]:
df2=df.copy() # Data preprocesing over a copy of original dataset
- Preprocessed data on a copy of the original dataset named df2
In [338]:
df2.isna().sum() # missing values per feature
Out[338]:
Unnamed: 0 0 price 0 rooms 0 bathroom 0 lift 0 terrace 0 square_meters 0 real_state 268 neighborhood 0 square_meters_price 0 dtype: int64
In [339]:
df2['real_state'].value_counts(dropna=False)
Out[339]:
real_state flat 6505 apartment 991 attic 315 NaN 268 study 109 Name: count, dtype: int64
In [340]:
# Add 'unknown' to categories
df2['real_state'] = pd.Categorical(df2['real_state'])
df2['real_state'] = df2['real_state'].cat.add_categories("unknown")
# Replace NaN values with 'unknown'
df2['real_state'] = df2['real_state'].fillna("unknown")
- Created a new category "unknown" in the variable 'real_state' replacing NaN
In [341]:
df2.isna().sum() # missing values per feature
Out[341]:
Unnamed: 0 0 price 0 rooms 0 bathroom 0 lift 0 terrace 0 square_meters 0 real_state 0 neighborhood 0 square_meters_price 0 dtype: int64
In [342]:
df2['real_state'].value_counts()
Out[342]:
real_state flat 6505 apartment 991 attic 315 unknown 268 study 109 Name: count, dtype: int64
In [343]:
df2.head()
Out[343]:
| Unnamed: 0 | price | rooms | bathroom | lift | terrace | square_meters | real_state | neighborhood | square_meters_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 750 | 3 | 1 | True | False | 60 | flat | Horta- Guinardo | 12.500000 |
| 1 | 1 | 770 | 2 | 1 | True | False | 59 | flat | Sant Andreu | 13.050847 |
| 2 | 2 | 1300 | 1 | 1 | True | True | 30 | flat | Gràcia | 43.333333 |
| 3 | 3 | 2800 | 1 | 1 | True | True | 70 | flat | Ciutat Vella | 40.000000 |
| 4 | 4 | 720 | 2 | 1 | True | False | 44 | flat | Sant Andreu | 16.363636 |
In [344]:
df2.drop(['Unnamed: 0'], axis=1, inplace=True)
- Removed the variable "Unnamed: 0" which had no value for modeling
In [345]:
# function to check for outliers
def count_outliers(df):
outlier_count=0
for column in df.select_dtypes(include=np.number).columns:
outliers=len(df[(df[column] < df[column].quantile(0.25)-1.5*(df[column].quantile(0.75)-df[column].quantile(0.25))) | (df[column] > df[column].quantile(0.75)+1.5*(df[column].quantile(0.75)-df[column].quantile(0.25)))][column])
print(f'{column}: {outliers} outliers ({outliers/df.shape[0]*100:.2f}%)')
outlier_count+= outliers
return outlier_count
In [346]:
count_outliers(df2)
price: 807 outliers (9.86%) rooms: 474 outliers (5.79%) bathroom: 174 outliers (2.13%) square_meters: 598 outliers (7.30%) square_meters_price: 589 outliers (7.19%)
Out[346]:
2642
In [347]:
df2.shape
Out[347]:
(8188, 9)
- There are outliers in all numerical variables. df2_shape:(8188, 9)
In [348]:
# Calculate z-scores for only numeric columns without creating dummies
outlier_mask = (np.abs(df2.select_dtypes(include=np.number).apply(zscore)) < 3).all(axis=1)
# Filter the DataFrame based on the outlier mask and retain the original column structure
df3 = df2[outlier_mask]
In [349]:
count_outliers(df3)
price: 672 outliers (8.68%) rooms: 318 outliers (4.11%) bathroom: 0 outliers (0.00%) square_meters: 438 outliers (5.66%) square_meters_price: 449 outliers (5.80%)
Out[349]:
1877
In [350]:
df3.shape
Out[350]:
(7742, 9)
- Applied the Z-score method, which removes outliers with more than 3 standard deviations. Some variables with a relevant percentage of outliers still remain. df3_shape:(7742, 9)
In [351]:
df4=df3.copy()
for column in df4.select_dtypes(include=np.number).columns:
df4[column]=np.clip(df4[column], df4[column].quantile(0.25)-1.5*(df4[column].quantile(0.75)-df4[column].quantile(0.25)), df4[column].quantile(0.75)+1.5*(df4[column].quantile(0.75)-df4[column].quantile(0.25)))
In [352]:
count_outliers(df4)
price: 0 outliers (0.00%) rooms: 0 outliers (0.00%) bathroom: 0 outliers (0.00%) square_meters: 0 outliers (0.00%) square_meters_price: 0 outliers (0.00%)
Out[352]:
0
In [353]:
df4.shape
Out[353]:
(7742, 9)
- Limited outliers to respective whisker boundaries. df4_shape:(7742, 9)
In [406]:
#creating dumies
df5 = pd.get_dummies(df4, columns=['real_state','neighborhood'], drop_first=False)
In [407]:
df5.shape
Out[407]:
(7742, 22)
- Created dummy variables for variables 'real_state' and 'neighborhood' considering drop_first=False. df5_shape:(7742, 22)
In [411]:
df5.info()
<class 'pandas.core.frame.DataFrame'> Index: 7742 entries, 0 to 8187 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 7742 non-null int64 1 rooms 7742 non-null float64 2 bathroom 7742 non-null int64 3 lift 7742 non-null int64 4 terrace 7742 non-null int64 5 square_meters 7742 non-null int64 6 square_meters_price 7742 non-null float64 7 real_state_apartment 7742 non-null int64 8 real_state_attic 7742 non-null int64 9 real_state_flat 7742 non-null int64 10 real_state_study 7742 non-null int64 11 real_state_unknown 7742 non-null int64 12 neighborhood_Sarria-Sant Gervasi 7742 non-null int64 13 neighborhood_Les Corts 7742 non-null int64 14 neighborhood_Eixample 7742 non-null int64 15 neighborhood_Sant Martí 7742 non-null int64 16 neighborhood_Ciutat Vella 7742 non-null int64 17 neighborhood_Gràcia 7742 non-null int64 18 neighborhood_Sants-Montjuïc 7742 non-null int64 19 neighborhood_Horta- Guinardo 7742 non-null int64 20 neighborhood_Sant Andreu 7742 non-null int64 21 neighborhood_Nou Barris 7742 non-null int64 dtypes: float64(2), int64(20) memory usage: 1.4 MB
In [412]:
# Convert boolean to numeric
cols = df5.select_dtypes(['bool'])
for i in cols.columns:
df5[i] = df5[i].astype('int')
In [413]:
df5.head()
Out[413]:
| price | rooms | bathroom | lift | terrace | square_meters | square_meters_price | real_state_apartment | real_state_attic | real_state_flat | ... | neighborhood_Sarria-Sant Gervasi | neighborhood_Les Corts | neighborhood_Eixample | neighborhood_Sant Martí | neighborhood_Ciutat Vella | neighborhood_Gràcia | neighborhood_Sants-Montjuïc | neighborhood_Horta- Guinardo | neighborhood_Sant Andreu | neighborhood_Nou Barris | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 750 | 3.0 | 1 | 1 | 0 | 60 | 12.500000 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 770 | 2.0 | 1 | 1 | 0 | 59 | 13.050847 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 1300 | 1.0 | 1 | 1 | 1 | 30 | 28.384073 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 2225 | 1.0 | 1 | 1 | 1 | 70 | 28.384073 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | 720 | 2.0 | 1 | 1 | 0 | 44 | 16.363636 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 22 columns
- Boolean variables were converted to numeric
Consolidated notes on Data Preprocesing¶
- Preprocessed data on a copy of the original dataset named df2
- Created a new category "unknown" in the variable 'real_state' replacing NaN
- Removed the variable "Unnamed: 0" which had no value for modeling
- There are outliers in all numerical variables. df2_shape:(8188, 9)
- Applied the Z-score method, which removes outliers with more than 3 standard deviations. Some variables with a relevant percentage of outliers still remain. df3_shape:(7742, 9)
- Limited outliers to respective whisker boundaries. df4_shape:(7742, 9)
- Created dummy variables for variables 'real_state' and 'neighborhood' considering drop_first=False. df5_shape:(7742, 22)
- Boolean variables were converted to numeric
EDA (pre-modeling)¶
In [414]:
df5.shape
Out[414]:
(7742, 22)
In [415]:
df5.info()
<class 'pandas.core.frame.DataFrame'> Index: 7742 entries, 0 to 8187 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 7742 non-null int64 1 rooms 7742 non-null float64 2 bathroom 7742 non-null int64 3 lift 7742 non-null int64 4 terrace 7742 non-null int64 5 square_meters 7742 non-null int64 6 square_meters_price 7742 non-null float64 7 real_state_apartment 7742 non-null int64 8 real_state_attic 7742 non-null int64 9 real_state_flat 7742 non-null int64 10 real_state_study 7742 non-null int64 11 real_state_unknown 7742 non-null int64 12 neighborhood_Sarria-Sant Gervasi 7742 non-null int64 13 neighborhood_Les Corts 7742 non-null int64 14 neighborhood_Eixample 7742 non-null int64 15 neighborhood_Sant Martí 7742 non-null int64 16 neighborhood_Ciutat Vella 7742 non-null int64 17 neighborhood_Gràcia 7742 non-null int64 18 neighborhood_Sants-Montjuïc 7742 non-null int64 19 neighborhood_Horta- Guinardo 7742 non-null int64 20 neighborhood_Sant Andreu 7742 non-null int64 21 neighborhood_Nou Barris 7742 non-null int64 dtypes: float64(2), int64(20) memory usage: 1.4 MB
In [416]:
df5.describe().T
Out[416]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| price | 7742.0 | 1208.508525 | 467.396946 | 320.000000 | 850.000000 | 1080.00000 | 1400.0 | 2225.000000 |
| rooms | 7742.0 | 2.355464 | 1.014828 | 0.500000 | 2.000000 | 2.00000 | 3.0 | 4.500000 |
| bathroom | 7742.0 | 1.425084 | 0.573727 | 1.000000 | 1.000000 | 1.00000 | 2.0 | 3.000000 |
| lift | 7742.0 | 0.696461 | 0.459816 | 0.000000 | 0.000000 | 1.00000 | 1.0 | 1.000000 |
| terrace | 7742.0 | 0.193878 | 0.395360 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| square_meters | 7742.0 | 77.151640 | 29.488338 | 10.000000 | 55.000000 | 71.00000 | 91.0 | 145.000000 |
| square_meters_price | 7742.0 | 16.445057 | 5.020367 | 5.555556 | 12.743952 | 15.09434 | 19.0 | 28.384073 |
| real_state_apartment | 7742.0 | 0.108887 | 0.311517 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| real_state_attic | 7742.0 | 0.035133 | 0.184128 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| real_state_flat | 7742.0 | 0.810385 | 0.392022 | 0.000000 | 1.000000 | 1.00000 | 1.0 | 1.000000 |
| real_state_study | 7742.0 | 0.014079 | 0.117825 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| real_state_unknown | 7742.0 | 0.031516 | 0.174720 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| neighborhood_Sarria-Sant Gervasi | 7742.0 | 0.161586 | 0.368095 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| neighborhood_Les Corts | 7742.0 | 0.061224 | 0.239757 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| neighborhood_Eixample | 7742.0 | 0.289848 | 0.453721 | 0.000000 | 0.000000 | 0.00000 | 1.0 | 1.000000 |
| neighborhood_Sant Martí | 7742.0 | 0.076466 | 0.265759 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| neighborhood_Ciutat Vella | 7742.0 | 0.172307 | 0.377671 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| neighborhood_Gràcia | 7742.0 | 0.086153 | 0.280609 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| neighborhood_Sants-Montjuïc | 7742.0 | 0.074399 | 0.262437 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| neighborhood_Horta- Guinardo | 7742.0 | 0.040300 | 0.196674 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| neighborhood_Sant Andreu | 7742.0 | 0.022991 | 0.149886 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
| neighborhood_Nou Barris | 7742.0 | 0.014725 | 0.120457 | 0.000000 | 0.000000 | 0.00000 | 0.0 | 1.000000 |
- The shape of the data for modeling (7742, 22) does not account for outliers
- The data for modeling have no missing values
In [417]:
univariate_numerical(df5)
- No outliers in the data
In [418]:
# Calculate correlation matrix
corr_matrix = df5.select_dtypes(include=np.number).corr()
In [419]:
# Plot correlation matrix as heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()
In [420]:
# Display the sorted correlation table
corr_unstacked = corr_matrix.unstack() # Unstack the correlation matrix
corr_unstacked = corr_unstacked.reset_index() # Reset the index to get 'variable1' and 'variable2' as columns
corr_unstacked.columns = ['variable1', 'variable2', 'correlation']# Rename the columns for better understanding
corr_unstacked = corr_unstacked[corr_unstacked['variable1'] != corr_unstacked['variable2']] # Remove self-correlations by filtering out rows where variable1 == variable2
corr_unstacked = corr_unstacked.drop_duplicates(subset=['correlation']) # Drop duplicates to keep only one entry per variable pair
sorted_corr = corr_unstacked.sort_values(by='correlation', key=abs, ascending=False) # Sort the DataFrame by the absolute value of correlation
sorted_corr # Display the sorted correlation table
Out[420]:
| variable1 | variable2 | correlation | |
|---|---|---|---|
| 163 | real_state_apartment | real_state_flat | -0.722653 |
| 49 | bathroom | square_meters | 0.706826 |
| 5 | price | square_meters | 0.693492 |
| 27 | rooms | square_meters | 0.681368 |
| 2 | price | bathroom | 0.596550 |
| ... | ... | ... | ... |
| 261 | real_state_unknown | neighborhood_Horta- Guinardo | 0.000627 |
| 238 | real_state_study | neighborhood_Sants-Montjuïc | -0.000458 |
| 260 | real_state_unknown | neighborhood_Sants-Montjuïc | -0.000432 |
| 217 | real_state_flat | neighborhood_Horta- Guinardo | 0.000268 |
| 210 | real_state_flat | neighborhood_Sarria-Sant Gervasi | 0.000187 |
231 rows × 3 columns
In [421]:
# Apply the function to create the corr_lvl column
sorted_corr['corr_lvl'] = sorted_corr['correlation'].apply(categorize_correlation)
sorted_corr['corr_lvl'].value_counts()
Out[421]:
corr_lvl Negligible 217 Low 8 Moderate 4 High 2 Name: count, dtype: int64
In [422]:
sorted_corr
Out[422]:
| variable1 | variable2 | correlation | corr_lvl | |
|---|---|---|---|---|
| 163 | real_state_apartment | real_state_flat | -0.722653 | High |
| 49 | bathroom | square_meters | 0.706826 | High |
| 5 | price | square_meters | 0.693492 | Moderate |
| 27 | rooms | square_meters | 0.681368 | Moderate |
| 2 | price | bathroom | 0.596550 | Moderate |
| ... | ... | ... | ... | ... |
| 261 | real_state_unknown | neighborhood_Horta- Guinardo | 0.000627 | Negligible |
| 238 | real_state_study | neighborhood_Sants-Montjuïc | -0.000458 | Negligible |
| 260 | real_state_unknown | neighborhood_Sants-Montjuïc | -0.000432 | Negligible |
| 217 | real_state_flat | neighborhood_Horta- Guinardo | 0.000268 | Negligible |
| 210 | real_state_flat | neighborhood_Sarria-Sant Gervasi | 0.000187 | Negligible |
231 rows × 4 columns
- Low correlation between variables, with only a couple of variables having a high correlation (bathroom and square_meters)
Consolidated notes on EDA (pre-modeling)¶
- The shape of the data for modeling (7742, 22) does not account for outliers
- The data for modeling have no missing values
- No outliers in the data
- Low correlation between variables, with only a couple of variables having a high correlation (bathroom and square_meters)
Modeling¶
Modeling Functions¶
In [423]:
# Define a function to evaluate and return the model's metrics
def evaluate_model(model, x_test, y_test):
y_pred = model.predict(x_test)
metrics = {
"MAE": mean_absolute_error(y_test, y_pred),
"MSE": mean_squared_error(y_test, y_pred),
"RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
"R2 Score": r2_score(y_test, y_pred)
}
return metrics
In [424]:
def models_regression_performance_cv_fit(models, x_train, y_train, x_test, y_test, res_df=None):
"""
Function to compute different metrics to check regression model performance
models: list of models to evaluate
x_train, y_train: training set predictors and target
x_test, y_test: validation set predictors and target
res_df: dataframe to store performance metrics
After fitting the model on the training data, the model is directly evaluated on the validation set.
"""
# If results_df is not provided, create an empty dataframe
if res_df is None:
res_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
# Store new rows in a list
results_list = []
# Perform cross-validation on the training set
for name, model in models:
# K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
# Cross-validation scores
cv_mse = -cross_val_score(model, x_train, y_train, scoring="neg_mean_squared_error", cv=kfold)
cv_mae = -cross_val_score(model, x_train, y_train, scoring="neg_mean_absolute_error", cv=kfold)
cv_r2 = cross_val_score(model, x_train, y_train, scoring="r2", cv=kfold)
# Calculate mean values for each metric
train_mse = cv_mse.mean()
train_mae = cv_mae.mean()
train_r2 = cv_r2.mean()
# Add the new row with the provided metrics
results_list.append({
"Model": f"{name}_train",
"MAE": train_mae,
"MSE": train_mse,
"RMSE": np.sqrt(train_mse),
"R2 Score": train_r2,
})
# Fit model and predict on test data
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
# Calculate test metrics
test_mae = mean_absolute_error(y_test, y_pred)
test_mse = mean_squared_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)
# Add the new row with the provided metrics
results_list.append({
"Model": f"{name}_val",
"MAE": test_mae,
"MSE": test_mse,
"RMSE": np.sqrt(test_mse),
"R2 Score": test_r2,
})
# Concatenate the list of results to results_df
res_df = pd.concat([res_df, pd.DataFrame(results_list)], ignore_index=True)
return res_df
- Defined function "evaluate_model(model, x_test, y_test)", to evaluate and return the model's metrics into a results dataframe
- Defined function "models_regression_performance_cv_fit(models, x_train, y_train, x_test, y_test, res_df=None)", to compute different metrics to check regression model performance
Preparing data for modeling¶
In [425]:
data=df5.copy()
- Data preparation for modeling over a copy named "data"
In [426]:
# specifying the independent and dependent variables
X = data.drop(["price"], axis=1)
y = data["price"]
# Checking independant and dependant shapes
print("Shape of independant variables : ", X.shape)
print("Shape of target variable : ", y.shape)
Shape of independant variables : (7742, 21) Shape of target variable : (7742,)
In [427]:
X.head()
Out[427]:
| rooms | bathroom | lift | terrace | square_meters | square_meters_price | real_state_apartment | real_state_attic | real_state_flat | real_state_study | ... | neighborhood_Sarria-Sant Gervasi | neighborhood_Les Corts | neighborhood_Eixample | neighborhood_Sant Martí | neighborhood_Ciutat Vella | neighborhood_Gràcia | neighborhood_Sants-Montjuïc | neighborhood_Horta- Guinardo | neighborhood_Sant Andreu | neighborhood_Nou Barris | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | 1 | 1 | 0 | 60 | 12.500000 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 2.0 | 1 | 1 | 0 | 59 | 13.050847 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 1.0 | 1 | 1 | 1 | 30 | 28.384073 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 1.0 | 1 | 1 | 1 | 70 | 28.384073 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2.0 | 1 | 1 | 0 | 44 | 16.363636 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 21 columns
In [428]:
# Apply Scaling to Input
input_scaler = MinMaxScaler()
X = pd.DataFrame(input_scaler.fit_transform(X), columns=X.columns)
In [429]:
# Apply Scaling to Output (target variable)
output_scaler = MinMaxScaler()
y = pd.DataFrame(output_scaler.fit_transform(y.values.reshape(-1, 1)), columns=["price"])
- Min-Max scaling was applied. The dataset has features with different scales, normalization ensures that no feature dominates the learning process.
In [430]:
# adding a constant to the independent variables
X = sm.add_constant(X)
In [431]:
X.head()
Out[431]:
| const | rooms | bathroom | lift | terrace | square_meters | square_meters_price | real_state_apartment | real_state_attic | real_state_flat | ... | neighborhood_Sarria-Sant Gervasi | neighborhood_Les Corts | neighborhood_Eixample | neighborhood_Sant Martí | neighborhood_Ciutat Vella | neighborhood_Gràcia | neighborhood_Sants-Montjuïc | neighborhood_Horta- Guinardo | neighborhood_Sant Andreu | neighborhood_Nou Barris | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.625 | 0.0 | 1.0 | 0.0 | 0.370370 | 0.304200 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 1 | 1.0 | 0.375 | 0.0 | 1.0 | 0.0 | 0.362963 | 0.328330 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 1.0 | 0.125 | 0.0 | 1.0 | 1.0 | 0.148148 | 1.000000 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 1.0 | 0.125 | 0.0 | 1.0 | 1.0 | 0.444444 | 1.000000 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 1.0 | 0.375 | 0.0 | 1.0 | 0.0 | 0.251852 | 0.473446 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 22 columns
In [432]:
X['const'].value_counts()
Out[432]:
const 1.0 7742 Name: count, dtype: int64
- Added constant for modeling
In [433]:
# splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
# Checking training and test sets.
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Shape of Training predictions : ", y_train.shape)
print("Shape of test predictions : ", y_test.shape)
Shape of Training set : (5419, 22) Shape of test set : (2323, 22) Shape of Training predictions : (5419, 1) Shape of test predictions : (2323, 1)
In [434]:
X_train.head()
Out[434]:
| const | rooms | bathroom | lift | terrace | square_meters | square_meters_price | real_state_apartment | real_state_attic | real_state_flat | ... | neighborhood_Sarria-Sant Gervasi | neighborhood_Les Corts | neighborhood_Eixample | neighborhood_Sant Martí | neighborhood_Ciutat Vella | neighborhood_Gràcia | neighborhood_Sants-Montjuïc | neighborhood_Horta- Guinardo | neighborhood_Sant Andreu | neighborhood_Nou Barris | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 280 | 1.0 | 1.000 | 1.0 | 1.0 | 0.0 | 1.000000 | 0.501322 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1026 | 1.0 | 0.875 | 1.0 | 1.0 | 0.0 | 1.000000 | 0.851761 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 557 | 1.0 | 0.125 | 0.0 | 1.0 | 0.0 | 0.259259 | 0.632737 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 7525 | 1.0 | 0.875 | 0.0 | 0.0 | 0.0 | 0.481481 | 0.369908 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1255 | 1.0 | 0.875 | 0.5 | 1.0 | 1.0 | 0.666667 | 0.361147 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 22 columns
- Data split 70/30. Shape of Training set : (5419, 22), Shape of test set : (2323, 22)
Model Building¶
In [435]:
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
- Performance Metrics:
- MAE (Mean Absolute Error): Measures the average magnitude of errors in a set of predictions, without considering their direction.
- MSE (Mean Squared Error): Measures the average of the squares of the errors, giving more weight to larger errors.
- RMSE (Root Mean Squared Error): The square root of MSE, providing error in the same units as the target variable.
- R2 Score (Coefficient of Determination): Indicates how well the model's predictions approximate the real data points. A value closer to 1 indicates a better fit.
In [436]:
# Dictionary of regression models to try
regression_models = {
"Linear Regression": LinearRegression(),
"Lasso Regression": Lasso(),
"Ridge Regression": Ridge(),
"Decision Tree": DecisionTreeRegressor(),
"Random Forest": RandomForestRegressor(),
"K-Nearest Neighbors": KNeighborsRegressor(),
"Support Vector Regressor": SVR()}
- Evaluated the following regression models:
- Linear Regression
- Lasso Regression
- Ridge Regression
- Decision Tree
- Random Forest
- K-Nearest Neighbors
- Support Vector Regressor
In [437]:
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
model.fit(X_train, y_train)
metrics = evaluate_model(model, X_test, y_test)
metrics["Model"] = model_name # Add model name for reference
results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
In [438]:
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
Out[438]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
In [439]:
results_df.sort_values(by="MAE")
Out[439]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
In [440]:
results_df.sort_values(by="MSE")
Out[440]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
In [441]:
results_df.sort_values(by="RMSE")
Out[441]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
- Random Forest: Best Performance. It has the lowest MAE (0.007321), MSE (0.000585), and RMSE (0.024185), and the highest R2 Score (0.990246), indicating it is the most accurate model among the ones tested.
- Decision Tree: Second Best. It also performs very well with low MAE, MSE, and RMSE, and a high R2 Score.
- Support Vector Regressor (SVR): Good Performance. It has a relatively low MAE, MSE, and RMSE, with a high R2 Score.
- Linear Regression and Ridge Regression: Similar Performance. Both have similar values for all metrics indicating decent performance.
- K-Nearest Neighbors (KNN): Moderate Performance. It has higher MAE, MSE, and RMSE, with a lower R2 Score, indicating it is less accurate compared to the top models.
- Lasso Regression: Poor Performance: It has the worst (highest) MAE, MSE, and RMSE, with a negative R2 Score, indicating it performs poorly on this dataset.
Model Tuning¶
In [442]:
%%time
# Define the model
rf = RandomForestRegressor()
# Define the parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
# Fit the model
grid_search.fit(X_train, y_train)
# Get the best parameters
best_params = grid_search.best_params_
print("Best parameters for Random Forest:", best_params)
Best parameters for Random Forest: {'max_depth': 20, 'min_samples_split': 5, 'n_estimators': 300}
CPU times: total: 5.67 s
Wall time: 3min 47s
In [443]:
%%time
# Define the model
dt = DecisionTreeRegressor()
# Define the parameter distribution
param_dist = {
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'criterion': ['mse', 'friedman_mse', 'mae']
}
# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=dt, param_distributions=param_dist, n_iter=100, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, random_state=42)
# Fit the model
random_search.fit(X_train, y_train)
# Get the best parameters
best_params = random_search.best_params_
print("Best parameters for Decision Tree:", best_params)
Best parameters for Decision Tree: {'min_samples_split': 5, 'max_depth': 20, 'criterion': 'friedman_mse'}
CPU times: total: 391 ms
Wall time: 1.14 s
In [444]:
%%time
# Define the model
svr = SVR()
# Define the parameter grid
param_grid = {
'C': [0.1, 1, 10, 100],
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'gamma': ['scale', 'auto']
}
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=svr, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
# Fit the model
grid_search.fit(X_train, y_train)
# Get the best parameters
best_params = grid_search.best_params_
print("Best parameters for SVR:", best_params)
Best parameters for SVR: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}
CPU times: total: 1.03 s
Wall time: 1min 20s
In [445]:
# Dictionary of regression models tuned
regression_models_tuned = {
"Tuned Random Forest": RandomForestRegressor(max_depth= 20, min_samples_split= 5, n_estimators=300),
"Tuned Decision Tree": DecisionTreeRegressor(min_samples_split= 5, max_depth= 20, criterion='friedman_mse'),
"Tuned Support Vector Regressor": SVR(C= 100, gamma='scale', kernel='rbf')}
- The three best performing models were selected for model tuning: Random Forest, Decision Tree, and Support Vector Regression.
In [446]:
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models_tuned.items():
model.fit(X_train, y_train)
metrics = evaluate_model(model, X_test, y_test)
metrics["Model"] = model_name # Add model name for reference
results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
In [447]:
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
Out[447]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 7 | Tuned Random Forest | 0.007344 | 0.000561 | 0.023676 | 0.990653 |
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 8 | Tuned Decision Tree | 0.010478 | 0.000859 | 0.029309 | 0.985676 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 9 | Tuned Support Vector Regressor | 0.040213 | 0.002837 | 0.053267 | 0.952687 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
In [448]:
results_df.sort_values(by="MAE")
Out[448]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 7 | Tuned Random Forest | 0.007344 | 0.000561 | 0.023676 | 0.990653 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 8 | Tuned Decision Tree | 0.010478 | 0.000859 | 0.029309 | 0.985676 |
| 9 | Tuned Support Vector Regressor | 0.040213 | 0.002837 | 0.053267 | 0.952687 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
In [449]:
results_df.sort_values(by="MSE")
Out[449]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 7 | Tuned Random Forest | 0.007344 | 0.000561 | 0.023676 | 0.990653 |
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 8 | Tuned Decision Tree | 0.010478 | 0.000859 | 0.029309 | 0.985676 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 9 | Tuned Support Vector Regressor | 0.040213 | 0.002837 | 0.053267 | 0.952687 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
In [450]:
results_df.sort_values(by="RMSE")
Out[450]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 7 | Tuned Random Forest | 0.007344 | 0.000561 | 0.023676 | 0.990653 |
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 8 | Tuned Decision Tree | 0.010478 | 0.000859 | 0.029309 | 0.985676 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 9 | Tuned Support Vector Regressor | 0.040213 | 0.002837 | 0.053267 | 0.952687 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
- After hyperparameter tuning, Tuned Random Forest, Random Forest, and Tuned Decision Tree are the best performing models
- The 3 best performing models after hyperparameter tuning will be evaluated with CrossValidation
Model Validation¶
In [451]:
# Splitting data into training, validation and test set (60-20-20)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=1)
print(X_train.shape, X_val.shape, X_test.shape)
(4644, 22) (1549, 22) (1549, 22)
- 60-20-20 data split for training-validation-testing
In [455]:
# List of models to evaluate
top3models = [
("Tuned Random Forest_cv", RandomForestRegressor(max_depth= 20, min_samples_split= 5, n_estimators=300)),
("Random Forest_cv", RandomForestRegressor()),
("Tuned Decision Tree_cv", DecisionTreeRegressor(min_samples_split= 5, max_depth= 20, criterion='friedman_mse'))
]
In [456]:
res_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
In [457]:
top3models= models_regression_performance_cv_fit(top3models, X_train, y_train, X_val, y_val,res_df)
In [458]:
top3models.sort_values(by="R2 Score", ascending=False)
Out[458]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 0 | Tuned Random Forest_cv_train | 0.008073 | 0.000570 | 0.023871 | 0.990497 |
| 2 | Random Forest_cv_train | 0.008088 | 0.000589 | 0.024270 | 0.990342 |
| 1 | Tuned Random Forest_cv_val | 0.008124 | 0.000651 | 0.025522 | 0.989572 |
| 3 | Random Forest_cv_val | 0.008107 | 0.000652 | 0.025542 | 0.989556 |
| 4 | Tuned Decision Tree_cv_train | 0.012006 | 0.000972 | 0.031180 | 0.983627 |
| 5 | Tuned Decision Tree_cv_val | 0.011874 | 0.001130 | 0.033613 | 0.981912 |
- Tuned Random Forest is the best performing model among the cross-validated models
In [459]:
final_results_df = pd.concat([results_df, top3models], ignore_index=True)
final_results_df
Out[459]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 7 | Tuned Random Forest | 0.007344 | 0.000561 | 0.023676 | 0.990653 |
| 8 | Tuned Decision Tree | 0.010478 | 0.000859 | 0.029309 | 0.985676 |
| 9 | Tuned Support Vector Regressor | 0.040213 | 0.002837 | 0.053267 | 0.952687 |
| 10 | Tuned Random Forest_cv_train | 0.008073 | 0.000570 | 0.023871 | 0.990497 |
| 11 | Tuned Random Forest_cv_val | 0.008124 | 0.000651 | 0.025522 | 0.989572 |
| 12 | Random Forest_cv_train | 0.008088 | 0.000589 | 0.024270 | 0.990342 |
| 13 | Random Forest_cv_val | 0.008107 | 0.000652 | 0.025542 | 0.989556 |
| 14 | Tuned Decision Tree_cv_train | 0.012006 | 0.000972 | 0.031180 | 0.983627 |
| 15 | Tuned Decision Tree_cv_val | 0.011874 | 0.001130 | 0.033613 | 0.981912 |
In [460]:
# Display the results DataFrame
final_results_df.sort_values(by="R2 Score", ascending=False)
Out[460]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 7 | Tuned Random Forest | 0.007344 | 0.000561 | 0.023676 | 0.990653 |
| 10 | Tuned Random Forest_cv_train | 0.008073 | 0.000570 | 0.023871 | 0.990497 |
| 12 | Random Forest_cv_train | 0.008088 | 0.000589 | 0.024270 | 0.990342 |
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 11 | Tuned Random Forest_cv_val | 0.008124 | 0.000651 | 0.025522 | 0.989572 |
| 13 | Random Forest_cv_val | 0.008107 | 0.000652 | 0.025542 | 0.989556 |
| 8 | Tuned Decision Tree | 0.010478 | 0.000859 | 0.029309 | 0.985676 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 14 | Tuned Decision Tree_cv_train | 0.012006 | 0.000972 | 0.031180 | 0.983627 |
| 15 | Tuned Decision Tree_cv_val | 0.011874 | 0.001130 | 0.033613 | 0.981912 |
| 9 | Tuned Support Vector Regressor | 0.040213 | 0.002837 | 0.053267 | 0.952687 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
In [461]:
final_results_df.sort_values(by="MAE")
Out[461]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 7 | Tuned Random Forest | 0.007344 | 0.000561 | 0.023676 | 0.990653 |
| 10 | Tuned Random Forest_cv_train | 0.008073 | 0.000570 | 0.023871 | 0.990497 |
| 12 | Random Forest_cv_train | 0.008088 | 0.000589 | 0.024270 | 0.990342 |
| 13 | Random Forest_cv_val | 0.008107 | 0.000652 | 0.025542 | 0.989556 |
| 11 | Tuned Random Forest_cv_val | 0.008124 | 0.000651 | 0.025522 | 0.989572 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 8 | Tuned Decision Tree | 0.010478 | 0.000859 | 0.029309 | 0.985676 |
| 15 | Tuned Decision Tree_cv_val | 0.011874 | 0.001130 | 0.033613 | 0.981912 |
| 14 | Tuned Decision Tree_cv_train | 0.012006 | 0.000972 | 0.031180 | 0.983627 |
| 9 | Tuned Support Vector Regressor | 0.040213 | 0.002837 | 0.053267 | 0.952687 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
In [462]:
final_results_df.sort_values(by="MSE")
Out[462]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 7 | Tuned Random Forest | 0.007344 | 0.000561 | 0.023676 | 0.990653 |
| 10 | Tuned Random Forest_cv_train | 0.008073 | 0.000570 | 0.023871 | 0.990497 |
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 12 | Random Forest_cv_train | 0.008088 | 0.000589 | 0.024270 | 0.990342 |
| 11 | Tuned Random Forest_cv_val | 0.008124 | 0.000651 | 0.025522 | 0.989572 |
| 13 | Random Forest_cv_val | 0.008107 | 0.000652 | 0.025542 | 0.989556 |
| 8 | Tuned Decision Tree | 0.010478 | 0.000859 | 0.029309 | 0.985676 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 14 | Tuned Decision Tree_cv_train | 0.012006 | 0.000972 | 0.031180 | 0.983627 |
| 15 | Tuned Decision Tree_cv_val | 0.011874 | 0.001130 | 0.033613 | 0.981912 |
| 9 | Tuned Support Vector Regressor | 0.040213 | 0.002837 | 0.053267 | 0.952687 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
In [463]:
final_results_df.sort_values(by="RMSE")
Out[463]:
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 7 | Tuned Random Forest | 0.007344 | 0.000561 | 0.023676 | 0.990653 |
| 10 | Tuned Random Forest_cv_train | 0.008073 | 0.000570 | 0.023871 | 0.990497 |
| 4 | Random Forest | 0.007321 | 0.000585 | 0.024185 | 0.990246 |
| 12 | Random Forest_cv_train | 0.008088 | 0.000589 | 0.024270 | 0.990342 |
| 11 | Tuned Random Forest_cv_val | 0.008124 | 0.000651 | 0.025522 | 0.989572 |
| 13 | Random Forest_cv_val | 0.008107 | 0.000652 | 0.025542 | 0.989556 |
| 8 | Tuned Decision Tree | 0.010478 | 0.000859 | 0.029309 | 0.985676 |
| 3 | Decision Tree | 0.010219 | 0.000944 | 0.030730 | 0.984253 |
| 14 | Tuned Decision Tree_cv_train | 0.012006 | 0.000972 | 0.031180 | 0.983627 |
| 15 | Tuned Decision Tree_cv_val | 0.011874 | 0.001130 | 0.033613 | 0.981912 |
| 9 | Tuned Support Vector Regressor | 0.040213 | 0.002837 | 0.053267 | 0.952687 |
| 6 | Support Vector Regressor | 0.043635 | 0.003353 | 0.057902 | 0.944096 |
| 0 | Linear Regression | 0.042146 | 0.004032 | 0.063495 | 0.932773 |
| 2 | Ridge Regression | 0.042285 | 0.004047 | 0.063619 | 0.932510 |
| 5 | K-Nearest Neighbors | 0.059559 | 0.009189 | 0.095859 | 0.846773 |
| 1 | Lasso Regression | 0.196899 | 0.059972 | 0.244891 | -0.000026 |
- From all models evaluated, Tuned Random Forest turns out to be the best performing model
Consolidated notes on Modeling¶
- Defined function "evaluate_model(model, x_test, y_test)", to evaluate and return the model's metrics into a results dataframe
- Defined function "models_regression_performance_cv_fit(models, x_train, y_train, x_test, y_test, res_df=None)", to compute different metrics to check regression model performance
- Data preparation for modeling over a copy named "data"
- Min-Max scaling was applied. The dataset has features with different scales, normalization ensures that no feature dominates the learning process.
- Added constant for modeling
- Data split 70/30. Shape of Training set : (5419, 22), Shape of test set : (2323, 22)
- Performance Metrics:
- MAE (Mean Absolute Error): Measures the average magnitude of errors in a set of predictions, without considering their direction.
- MSE (Mean Squared Error): Measures the average of the squares of the errors, giving more weight to larger errors.
- RMSE (Root Mean Squared Error): The square root of MSE, providing error in the same units as the target variable.
- R2 Score (Coefficient of Determination): Indicates how well the model's predictions approximate the real data points. A value closer to 1 indicates a better fit.
- Evaluated the following regression models:
- Linear Regression
- Lasso Regression
- Ridge Regression
- Decision Tree
- Random Forest
- K-Nearest Neighbors
- Support Vector Regressor
- Random Forest: Best Performance. It has the lowest MAE (0.007321), MSE (0.000585), and RMSE (0.024185), and the highest R2 Score (0.990246), indicating it is the most accurate model among the ones tested.
- Decision Tree: Second Best. It also performs very well with low MAE, MSE, and RMSE, and a high R2 Score.
- Support Vector Regressor (SVR): Good Performance. It has a relatively low MAE, MSE, and RMSE, with a high R2 Score.
- Linear Regression and Ridge Regression: Similar Performance. Both have similar values for all metrics indicating decent performance.
- K-Nearest Neighbors (KNN): Moderate Performance. It has higher MAE, MSE, and RMSE, with a lower R2 Score, indicating it is less accurate compared to the top models.
- Lasso Regression: Poor Performance: It has the worst (highest) MAE, MSE, and RMSE, with a negative R2 Score, indicating it performs poorly on this dataset.
- The three best performing models were selected for model tuning: Random Forest, Decision Tree, and Support Vector Regression.
- After hyperparameter tuning, Tuned Random Forest, Random Forest, and Tuned Decision Tree are the best performing models
- The 3 best performing models after hyperparameter tuning will be evaluated with CrossValidation
- 60-20-20 data split for training-validation-testing
- Tuned Random Forest is the best performing model among the cross-validated models
- From all models evaluated, Tuned Random Forest turns out to be the best performing model
Final Model¶
In [464]:
# Define the model with the selected hyperparameters
RandomForestTuned = RandomForestRegressor(max_depth= 20, min_samples_split= 5, n_estimators=300)
# Train the model on the entire training dataset
RandomForestTuned.fit(X_train, y_train)
# Evaluate the model on the test set
y_pred = RandomForestTuned.predict(X_test)
In [465]:
evaluate_model(RandomForestTuned, X_test, y_test)
Out[465]:
{'MAE': np.float64(0.007490556570714266),
'MSE': np.float64(0.0005540728556713073),
'RMSE': np.float64(0.023538752211434387),
'R2 Score': 0.9905977292708693}
In [466]:
# Feature importance
feature_importances = pd.Series(RandomForestTuned.feature_importances_, index=X_train.columns)
feature_importances = feature_importances.sort_values(ascending=False)
# Plotting
plt.figure(figsize=(10, 6))
feature_importances.plot(kind='bar')
plt.title('Feature Importance')
plt.xlabel('Features')
plt.ylabel('Importance Score')
plt.show()
In [467]:
# Crear el gráfico de dispersión con la línea de regresión
plt.figure(figsize=(8,6))
plt.scatter(y_test, y_pred, color='blue')
plt.show()
Conclusions¶
- square_meters is the most important variable to predict unit price and at least an aproximated value should by provided as input.
- Having square_meters_price as the second most important variable, and considering it might be an unknown input, the model could consider mean by to inpute.
- The most accurate model identified to predict price is: RandomForestRegressor(max_depth= 20, min_samples_split= 5, n_estimators=300)
- With this model the performance achieved is MAE:0.00749, MSE:0.000554, RMSE:0.023538, and R2 Score: 0.990597
- Results obtained with scaled data (Min-Max scaling)
Deploy¶
In [470]:
# Save model and scaler
joblib.dump(RandomForestTuned, "Random_Forest_Tuned.pkl")
Out[470]:
['Random_Forest_Tuned.pkl']
In [471]:
joblib.dump(input_scaler, "Input_scaler.pkl")
Out[471]:
['Input_scaler.pkl']
In [472]:
joblib.dump(output_scaler, "Output_scaler.pkl")
Out[472]:
['Output_scaler.pkl']
- Using the joblib library in Python to save Python objects like machine learning models, scalers, and other data-processing objects to disk.
- It allows to efficiently store these objects and later reload them for future use, avoiding the need to retrain models or reinitialize objects.
- joblib is commonly used for saving trained models because it’s optimized for handling large amounts of data, especially with NumPy arrays or objects from scikit-learn.
In [473]:
!jupyter nbconvert --to html PROJECT1_TEAM2_CM.ipynb
C:\Users\otroc\AppData\Local\Programs\Python\Python313\Scripts\jupyter-nbconvert.EXE\__main__.py:4: DeprecationWarning: Parsing dates involving a day of month without a year specified is ambiguious and fails to parse leap day. The default behavior will change in Python 3.15 to either always raise an exception or to use a different default year (TBD). To avoid trouble, add a specific year to the input & format. See https://github.com/python/cpython/issues/70647. [NbConvertApp] Converting notebook PROJECT1_TEAM2_CM.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 19 image(s). [NbConvertApp] Writing 4377591 bytes to PROJECT1_TEAM2_CM.html